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Abstract 

Background: Protein structures are flexible and often show conformational changes upon binding to other 
molecules to exert biological functions. As protein structures correlate with characteristic functions, structure 
comparison allows classification and prediction of proteins of undefined functions. However, most comparison 
methods treat proteins as rigid bodies and cannot retrieve similarities of proteins with large conformational 
changes effectively. 

Results: In this paper, we propose a novel descriptor, local average distance (LAD), based on either the geodesic 
distances (GDs) or Euclidean distances (EDs) for pairwise flexible protein structure comparison. The proposed 
method was compared with 7 structural alignment methods and 7 shape descriptors on two datasets comprising 
hinge bending motions from the MolMovDB, and the results have shown that our method outperformed all other 
methods regarding retrieving similar structures in terms of precision-recall curve, retrieval success rate, R-precision, 
mean average precision and F^-measure. 

Conclusions: Both ED- and GD-based LAD descriptors are effective to search deformed structures and overcome 
the problems of self-connection caused by a large bending motion. We have also demonstrated that the ED-based 
LAD is more robust than the GD-based descriptor. The proposed algorithm provides an alternative approach for 
blasting structure database, discovering previously unknown conformational relationships, and reorganizing protein 
structure classification. 



Background 

Protein structure comparison plays an important role in 
predicting functions of novel proteins [1] and several 
methods have been developed for pairwise [2-8] and mul- 
tiple [9-16] comparisons. Most existing methods of struc- 
ture comparison treat proteins as rigid bodies; however, 
protein structures are flexible and conformationally 
changeable in response to binding another molecules rela- 
ting with biological functions such as immune protection, 
enzymatic catalysis and cellular locomotion [17,18]. Such 
structural variations caused rigid-body algorithms unable 
to generate correct alignments or retrieve similar structures 
with large deformations. Therefore, flexibility of proteins 
should be taken into account when comparing structures 
and searching for similarities to a query structure. 
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Alignment methods 

Flexible structure comparison has received much attention 
in recent years. For instance, FlexProt found congruent 
rigid fragment pairs between two proteins and the flexible 
regions (hinges), and then a clustering procedure was per- 
formed to join consecutive fragment pairs into congruent 
domain pairs [19,20]. FATC AT connected aligned fragment 
pairs based on a dynamic programming algorithm which 
introduced penalty scores for gaps and twists between con- 
secutive aligned fragment pairs [21]. Compared with Flex- 
Prot, FATCAT generates alignments with less twists but 
similar root mean square deviations (RMSDs) and lengths. 
The TOPS++FATCAT algorithm reduced the number of 
aligned fragment pairs during FATCAT comparison pro- 
cesses by applying topological constraints obtained from 
the alignment of secondary structure elements (SSEs) of 
TOPS + [22]. Therefore, TOPS++FATCAT is more than 10 
times faster compared to FATCAT. Both FlexProt and 
FATCAT are sequential alignment algorithms thus unable 
to identify non-sequential alignments. FASE [23] and 
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FlexSnap [24] were designed to tackle the problem of non- 
sequential flexible structure alignment. FASE compares 
structures starting from aligned pairs of SSEs with an as- 
sumption that an optimal superposition of pairs of struc- 
tures must have at least one pair of well-aligned SSEs. 
FlexSnap applies a greedy algorithm for connecting aligned 
fragment pairs and possesses competitive results against 
other state-of-the-art pairwise comparison methods. Matt, 
one of the most popular and accurate flexible multiple 
structure alignment methods, is also based on the approach 
of chaining aligned fragment pairs which are allowed trans- 
lations and rotations during assembling [25,26]. 

Non-alignment methods 

The alignment/superposition based comparison methods 
are inefficient for blasting similar structures from a struc- 
ture database in real-time [27]. Therefore, several non- 
alignment approaches based on different descriptors of 
molecular shapes were proposed. These descriptors are 
usually represented by histograms or vectors, and a simi- 
larity score between two molecules is calculated from cor- 
responding descriptors without any alignment [28,29] . For 
example, Daras et al applied the spherical trace transform 
method to produce rotational invariant descriptor vectors 
constituted by weighted geometry- and attribute-based 
vectors for protein classification [30]. The 3D Zernike de- 
scriptor represented a protein structure by 121 numbers 
based on a series expansion of 3D functions for fast re- 
trieval of similarities, and which demonstrated that low- 
resolution structures were also applicable [27,31]. Abu 
Deeb et al proposed a global descriptor on protein sur- 
face, and which was constructed from local patch descrip- 
tors defined by residue-specific distance distributions 
between Ca atoms and the numbers of pairwise residue 
co-occurrences within each surface patch [32]. Yin et al 
compared local surface of proteins by geometric finger- 
prints of each surface patch [33] . A fingerprint consists of 
60 (4 by 15) bins corresponding to the geodesic-distance- 
dependent distribution of curvatures. 

Nevertheless, most non-alignment methods treated 
proteins as rigid bodies and neglected flexibflity of pro- 
tein conformations required for performing biological 
functions. To confront the issue of flexibflity, Liu and 
Fang et al proposed several histogram based descriptors 
for flexible molecules comparison. For instance, a local 
diameter descriptor for depicting the local characteristics 
of boundary points [34], and another descriptor, inner 
distance, defined as the shortest path between landmark 
points [28,35]. Both methods are sensitive to self-connection 
problems during molecular shape deformation. Accordingly, 
an improved method named Diffusion Distance Shape De- 
scriptor (DDSD) was proposed, which is based on an aver- 
age distance instead of the shortest distance between two 
landmark points [36]. Although DDSD is superior to local 



diameter, inner distance and other descriptors in terms of 
retrieving similar protein structures, its performance is still 
unsatisfled with an Fi-measure of 37.04%. 

Proposed method 

Non-alignment or descriptor based approaches are gener- 
ally fast enough to search a large database in a real-time 
manner, but do not provide corresponding information of 
residues which might provide crucial information for biol- 
ogists. Combining the ideas of alignment and descriptor 
based approaches, we propose a novel and efficient de- 
scriptor cafled local average distance (LAD) which is based 
on either geodesic distances (CDs) or Euclidean distances 
(EDs) for pairwise flexible protein structure comparison. 
Each protein structure is flrstly transformed into its corre- 
sponding LAD proflle, and the similarity between two pro- 
teins is calculated according to pairwise local alignment on 
transformed proflles. The Hinge Atlas and Hinge Atlas Gold 
datasets [37] fi:om the MolMovDB [38] were employed to 
evaluate the performance of proposed LAD descriptors and 
to compare with several non-alignment and rigid/flexible 
structure alignment methods. 

Methods 

The proposed protein structure comparison algorithm is 
based on the LAD profile which is built from pairwise 
residue distances (ED or GD) within a protein. The 
workflow of generating profiles from atomic coordinates 
of proteins is shown in Figure 1. The simflarity between 
two proteins is determined by a local pairwise alignment 
of their corresponding LAD profiles. The core proce- 
dures can be decomposed into triangular surface construc- 
tion, surface simplification, ED/GD calculation, profile 
construction and profile comparison. Details of each step 
are introduced in the foUowing sections. 

Triangular surface construction and simplification 

The solvent-accessible surface (SAS) [39] and solvent- 
excluded surface [40,41] (SES, also known as molecular 
surface or Connolly surface) are the most widely used 
definitions for protein surface analysis. Each atom of a 
protein is represented as a sphere with its van der Waals 
radius. The SAS is traced out by the center of a solvent 
probe sphere rolling over the spherical atoms, whereas 
the SES is formed by the inward-facing surface of the 
probe consisting of contact surface and re-entrant surface. 
For a more complete description of both SAS and SES 
please refer to [42] . Many algorithms have been developed 
to bufld SAS and/or SES such as Gauss-Bonnet theorem 
[43], level-set [44], alpha shape [45,46], beta shape [47], 
Euclidean distance transform [48], ray-casting [49] et al 
[50-52]. One common area-based method defines a resi- 
due as a surface residue if its surface area is greater than a 
specific threshold [46,53]. The other area-based methods 
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Figure 1 Flowchart of LAD profile generation for the protein 5rsa:A. (a) Yellow ribbons display the secondary structures and red spheres 
represent the backbone atoms (N, Ca, C, 0) of each residue, (b) The MSMS-generated triangular surface, (c) The simplified surface which is 
created by MeshLab. (d) and (e) are LAD profiles of the input protein for ED (LADed) and GD (LADgd) respectively. 



consider a residue with relative solvent accessibility larger 
than a threshold as a surface residue [54,55]. The relative 
solvent accessibility is defined by taking a residue s solvent- 
accessible area divided by the maximum area of that resi- 
due [56,57]. In recent years, novel atom-depth-based ap- 
proaches were proposed as alternative ways to define 
surface residues [58,59]. Different algorithms employed 
various definitions of atom depth which could be defined 
as the distance of an atom from the nearest water molecule 
surrounding the protein, from the molecular surface, or 
from its closest solvent-accessible neighbor [60]. 

The input for building an LAD profile is a standard PDB 
file. Owing to the requirement of triangular surface meshes 
for GD calculation, one of the most used and fastest sur- 
face program, MSMS v2.6.1 [61], is applied to construct 
triangular surface meshes from coordinates of all backbone 
atoms of the protein (Figure la). All the parameters of 
MSMS are remained as default settings. This tool usually 



generates high resolution meshes (Figure lb) for proteins. 
However, it is time-consuming and memory exhausted 
during the calculation of GDs among mesh vertices. To re- 
duce the resolution of MSMS-generated meshes, an open 
source tool, MeshLab vl.3.2 (http://meshlab.sourceforge. 
net/), is adopted to downsample original meshes. The out- 
puts of MSMS are converted into Polygon File Format 
(Stanford Triangle Format) as MeshLabs inputs. The algo- 
rithm of Quadric edgecollapse, a variant of the well-known 
quadric error metric algorithm [62] , is employed to sim- 
plify meshes (Figure Ic). As a result, the face number of 
each MSMS-generated mesh could be reduced by 85% 
generally in this research. 

Calculation of pairwise residue distances 

The simplified meshes are then used to identify surface 
residues, and the GDs and EDs of surface residue pairs 
can be obtained. Each vertex of a simplified mesh 
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belongs to the closest backbone atom of the protein. In 
other words, an atom could possess more than one ver- 
tex. We defined that the vertices belong to an atom as 
the associated vertices of that atom. A residue is 
regarded as a surface residue if its backbone atoms have 
at least one vertex. 

GD is the shortest path along the surface from source 
to destination points. We adopted the previously pub- 
lished open source program provided by Danil Kirsanov 
(http://code.google.eom/p/geodesic/) to calculate GDs 
between any two vertices from simplified meshes. The 
GD between two atoms, Ui and Uj is defined by taking 
average of GDs from all associated vertices and repre- 
sented as the following: 



GD{ai,aj) 



MxN 



where GD{ai, aj) is the average GD from the i'^ atom to 
the atom, and represent the x^^ vertex of the 
atom and the vertex of the atom respectively. The 
symbols M and N indicate the number of vertices associ- 
ated with the atom and the atom, and GD(^vf,v^^ 
is the GD from vertex to vertex i/. The atoms posses- 
sing no associated vertices won't be considered, hence 
M and N must be strictly larger than zero. In contrast to 
the measurement of GD, an ED between two atoms can 
be easily obtained from their coordinates. Once the two 
different distance measures between any two atoms are 
obtained, the distance measures between any two resi- 
dues can be calculated similarly by taking an average of 
GDs or EDs from all associated backbone atoms. 

Construction of LAD profiles 

LAD is proposed to retain local characteristics of each 
residue in sequential relationship. The LAD profile for a 
protein consists of average distance values which are 
built by employing a sliding window scanning from N- 
to C-terminus. In this study, we have tried different odd 
window sizes ranging from 3 to 21, and the window size 
of 9 residues provided the best performance on the 
training dataset {Dataset L from ADiDoS [63]). Hence, a 
window size of 9 is applied to build all LAD profiles. We 
have implemented two types of LAD profiles; one is 
based on ED feature (LADed> Figure Id) and the other is 
based on GD (LADgd^ Figure le) feature. Given a resi- 
due at position / (residue^) in the sequence, the LAD/ for 
the residue/ is defined by taking average distance from 
residue/ to both side neighbouring residues within the 
window. 

LAD diversity 

The pairwise structure comparison in this study is based 
on evaluating the similarities of two LAD profiles from 



two individual proteins. A variation of Smith- Waterman 
algorithm is performed to obtain the correspondence of 
residues between two proteins by comparing LADs in- 
stead of amino acid contents. The similarity score be- 
tween two residues, residue/ and residue^, for dynamic 
programming is inversely proportional to the absolute 
difference between LAD/ and LADy. 

The similarity of two proteins is quantified by the result 
of pairwise profile alignment. A novel scoring function 
named as LAD diversity (LAD^iy) is proposed, which con- 
siders the number of equivalent (aligned) residues (A/^) 
and the root-mean-square deviation (RMSD) of LADs for 
aligned residues. The LAD^iy is defined in the following 
equation where Nq and Ng are lengths of the query and 
the subject proteins respectively. The symbols D and a are 
used to adjust the effect of RMSD on the LAD^i^, Since A/g 
must be less than or equal to Nq and A/5, the value of 
LADdiv is between 0 and 1, and smaller values represent 
higher similarities. 



LADdi, 



1- 



A^. 



mean(A^Q,A^5) [l + (^)l 



Profile alignment of a similar structure pair tends to 
hold a low RMSD and a large A/g, and therefore results in 
a low LAD^iy, For example, a domain swapping protein 
pair illustrated in the section of self-connection problem 
possessing (RMSD, LADaiv) of (0.173, 0.0004) and (0.454, 
0.02) for LADed and LADgd respectively. Conversely, a 
dissimilar structure pair possesses a high LAD^iy with a 
large RMSD and a low A/^ simultaneously. Figure 2 shows 
an instance of profile alignment for a non-homologous 
protein pair which possesses different conformations, and 
accordingly, the LADed profiles obtained high values of 
(RMSD, LAD^iy) as (1.601, 0.955) compared to the previ- 
ous example. 

Variables D and a were trained by the Dataset L [63] 
which contains 706 known domain swapping homologous 
pairs (Lds), 487 common homologous pairs {Lch) and 640 
non-homologous pairs {Lnh) of protein structures. Both 
Lds and Lch were considered as a positive dataset in which 
each pair was anticipated possessing low LAD^iv values. 
Conversely, Lnh was considered as a negative dataset 
which was expected possessing high LAD^iy values for 
each pairs. Let Lds^o s and Lch^o s denote the number of 
pairs whose LAD^i^ is less than 0.5 for both Lds and Lch, 
The Lnh> 0.5 represents the number of pairs whose LAD^iv 
is larger than or equal to 0.5. We have evaluated D ranging 
from 0.1 to 20 with an interval of 0.1, and a range of 1 to 
5 with an interval of 0.5 for a. Hence, a total of 1800 
(200 X 9) combinations of D and a were evaluated and the 
one with maximum Lds^Qs + Lch^o^ + Lnh> 0.5 was selected. 
Finally, (A a) = (1, 4.5) and (A a) = (1.1, 5) were selected 
for LADed and LADqd respectively. 
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Figure 2 An example of LADed profile alignment for a non-homologous protein pair. Owing to conformations of botli structures are 
significantly different, tlieir LADed profiles could not be aligned properly, and therefore resulted in a high MD^,v. (a) Cartoon representation for 
the non-homologous protein pair of lwwa:X (green) and lkOy:C (blue), (b) The pairwise alignment result of LADed profiles for both structures. 
The X-axis represents the serial numbers of residue pair and the y-axis denotes LAD values. 



Structural diversity 

There are many different ways to measure protein struc- 
tural similarity of aligned results, and many of them have 
been reviewed in [1]. According to our previous research 
[63], the structure diversity {Struct^iy) [64] showed superior 
performances on distinguishing homologous proteins from 
non-homologous ones upon various structural comparison 
methods. Therefore, Structdiv was employed in this study 
to compare existing rigid/flexible structural alignment tools 
with our proposed method. Struct^iy is defined as: 



Structdiv 



RMSD 



:i{Nq,Ns) 



where RMSD is the root mean square deviation of the 
distances between the aligned Ca atoms. Like LADdiy, 
structural alignment of a similar structure pair tends to 
have both low RMSD and large A/g, and low Structdiv 

Testing datasets 

There were two testing datasets applied in this research to 
validate our method and compare with existing methods. 
The first one is Hinge Atlas dataset which contains 2791 
protein structures of 214 non-redundant morphs exhibit- 
ing hinge bending motions. The lengths of proteins range 
from 28 to 994 residues. A morph is a group of structures 
(9 to 32) comprising two homologous proteins with differ- 
ent conformations and several interpolated structures be- 
tween these two initial structures. About 97% of morphs 
in the dataset possess three or less hinge points. Figure 3 
shows an example of morph with a large conformational 
change for the protein GroEL containing 524 residues. 
Neither LADed and LADqd descriptors are sensitive to 
the deformation, especially for LADed. The second dataset 



provided by Liu et al was a subset of Hinge Atlas [37] and 
Hinge Atlas Gold datasets, and which was applied in the 
previous study [36]. The Lius dataset contains 382 protein 
structures of 27 groups with large degrees of conform- 
ational changes. 

Results 

Comparison with structural alignment methods 

LAD descriptors were compared with 2 rigid and 5 flex- 
ible structural alignment methods on the Hinge Atlas 
dataset in terms of retrieving similar structures which 
belong to the same group (morph) as the query struc- 
ture. The first structure in each group was regarded as 
the representative for that group, and the remaining 
2577 proteins were considered as query structures. Each 
query protein compared with 214 representatives, and 
there were a total of 551478 (2577 x 214) pairwise com- 
parisons. The results for each query were sorted accord- 
ing to the diversity scores {LADdiv or Struct div)> and it 
was regarded as a successful retrieval if the representa- 
tive belonging to the same group as query proteins was 
ranked at the first place. The retrieval performance for 
LAD and other structural alignment methods on the 
Hinge Atlas dataset were summarized in Table 1. The 
results have shown that LADed and LADgd performed 
better than other methods and achieved retrieval success 
rates of 97.1% and 95% respectively. The structural align- 
ment methods generated unsatisfied alignment results 
even though the relevant structures were successfully re- 
trieved at the first place. For example, all methods ranked 
the relevant structure of ffO at the top position for the 
query structure of ff9 from the morph group of va2eznA- 
115b A, and it is a domain-swapped dimer of Cyanovirin-N 
(Figure 4a). In this case, LADed> LADgd) FlexProt, FlexS- 
nap and jFATCAT (Figure 4b) could align the protein pair 



Wang et al. BMC Bioinformatics 2014, 15:95 
http://www.bionnedcentral.conn/1471 -21 05/1 5/95 



Page 6 of 1 3 




Residue Index 

Figure 3 A morph for the protein GroEL in the Hinge Atlas dataset. There are 20 structures in this morph (morph id is 80551 1-5128) 
containing 4 hinge residues: 191G, 192 M, 372A and 373G. (a) and (d) are the first and last proteins in the morphing group respectively, (b) is 
the 7^^ interpolated structure and (c) for the 14^^ structure, (e) and (f) represent the LADed and LADgd profiles for the four structures of the 
same protein and both profiles are insensitive to the conformational changes. The x-axis represents the serial numbers of residues and the y-axis 
denotes LAD values. Figures (a)-(d) were generated by PyMOL (http://www.pymol.org/), and (e)-(f) by Highcharts (http://www.highcharts.com/). 



completely, but FASE (Figure 4c), Fast (Figure 4d), Matt-Rigid 
(Figure 4e) and Matt- Flexible (Figure 4f) only aligned 
half portion of the structure. 

In addition to the measure of successful retrieval rates, we 
also evaluated the performances for the Hinge Atlas dataset 
based on the precision-recall curve of 11 -point interpolated 
average precision which is a common measurement in 



information retrieval systems [65]. It should be noted that 
the 214 representatives were treated as query structures in- 
dividually, and each of them compared with the remaining 
2577 structures in order to search structures belonging to 
the same group. A precision rate is the fraction of retrieved 
structures that are relevant to the query protein, and a recall 
rate is the fraction of relevant structures that are successfrilly 
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Table 1 Retrieval performances of 2577 queries for 
different methods on the Hinge Atlas dataset 



Method 


Number of successful retrieval 


Success rate (%) 


LADed 


2502 


97.1 


LADgd 


2447 


95.0 


Matt-Flexible 


2342 


90.9 


FlexSnap 


2329 


90.4 


FASE 


2282 


88.6 


jFATCAT 


2241 


87.0 


FAST* 


2234 


86.7 


Matt-Rigid* 


2185 


84.8 


FlexProt 


2167 


84.1 



*Rigid alignment metliod. 

Tine results are ordered by the success rates and show that both LADed and 
LADgd outperform other methods. 

retrieved. Precision and recall rates are defined in the follow- 
ing equations: 

TP 

Precision 
Recall : 



TP + FP 
TP 



TP^FN 



True positive (TP) is the number of successful retrieved 
structures; false positive {FP) represents the number of in- 
accurately retrieved structures; false negative {FN) denotes 
the number of structures belonging to the same group as 
query but not being retrieved. The interpolated precision 
for a specific recall r is defined as the maximum precision 
over any recall r ' > r [65]. For each query, a set of 11 inter- 
polated precisions at 11 recall levels (0, 0.1, 0.2 ... 1) were 
determined, then averages of interpolated precisions for 
214 queries at each level were calculated. According to 
the precision-recall curves (see Figure 5), both LADed and 
LADgd outperformed other methods since they possessed 
larger area under the curve. 

i?-Precision and Mean Average Precision (MAP) are 
the other common quantitative measures for evaluating 
overall performance of information retrieval systems. If 
there are total R relevant structures for a query, 7?-Preci- 
sion is defined as the number of relevant structures in 
the top R retrieved structures divided by R, For a query. 
Average Precision is an average of precisions for each 
relevant structure. MAP is defined as the mean of the 
Average Precisions for a set of queries. For more details 
of calculating these measures please refer to [65]. The 





Figure 4 An example of successful retrieval but with poor structure alignments. The structure pair is from the morphing group of va2eznA-ll5bA in 
the Hinge Atlas dataset. (a) The open-form (green, flf9) and closed-form (blue, ffO) of Cyanovirin-N. (b) to (f) are structure alignments generated by jFATG\T, FASE, Fast, 
Matt-Rigid and Matt-Flexible respectively. The non-aligned regions are colored by gray. All methods ranked the dosed-form of Cyanovirin-N at the top of 214 representative 
structures when the open-form of Cyanovirin-N as a quer/; nevertheless, FASE, Fast, Matt-Rigid and Matt-Flexible only aligned half portion of the query protein. 

V . J 
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Figure 5 Precision-recall curves of 1 1 -point interpolated average precision for different methods on the Hinge Atlas dataset. The top 

blue and orange curves represent LADed and LADgd respectively, and show that both LAD methods provide the best performance. 



Table 2 Retrieval performances of 214 queries for 


different methods on the Hinge Atlas dataset 


Method 


Average /?-precision (%) 


Mean average precision (%) 


LADed 


95.54 


96.67 


LADgd 


93.53 


94.95 


Matt-Flexible 


87.55 


89.62 


FlexSnap 


86.97 


89.71 


FASE 


84.97 


87.40 


jFATCAT 


83.36 


86.23 


FAST 


82.81 


86.16 


Matt-Rigid 


82.24 


85.33 


FlexProt 


87.14 


89.98 



average i?-Precision and MAP of 214 queries for diffe- 
rent methods are shown in Table 2. The results have 
shown that both LADed and LADqd performed superior 
to other methods, and LADed achieves an average of 
95.54% for 7?-Precision and 96.67% for MAP. 



Comparison with non-alignment methods 

The Lius dataset was employed to compare LAD de- 
scriptor with non-alignment methods. In order to com- 
pare with the results in [36], only the top 64 retrieved 
structures for each query were used to compute the pre- 
cision and recall rates. The Fi -measure is the harmonic 
mean of recall and precision rates defined as: 



Fi -measure 



2 X Precision x Recall 
Precision + Recall 
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where the maximum value is 1. In contrast to the arith- 
metic mean, both precision and recall rates need to be 
high to obtain a high Fi -measure. The retrieval perform- 
ance of Fi-measure is listed in Table 3. LADed and 
LADgd achieved 43.27% and 43.18% of Fi-measure re- 
spectively and outperformed the other 7 non-alignment 
methods with a highest Fi -measure of 37.04%. 

Discussion 

Self-connection problem 

Figure 6 is an example of bona fide domain swapping pro- 
tein pair holding self-connection on surface caused by a 
large hinge bending motion. The difficulty is that a self- 
connection leads to topology changes, hence the inner dis- 
tance method considering all landmark points cannot solve 
this problem [35,36]. However, this type of deformation can 
be overcome by our proposed descriptor especially for 
LADed approach since an LAD only considers the local 
geometric properties which are not sensitive to global top- 
ology changes. Figure 6d and Figure 6e have shown a high 
consistency of LADed and LADqd profiles between open- 
form (PDB code: la2w, chain A) and close-form (PDB 
code: 5rsa, chain A) of Ribonuclease A respectively. It is ob- 
vious that LADed is more consistent than LADqd in this 
case, but both LAD^iv are close to zero representing highly 
similar conformations. The (RMSD, LAD^iy) for LADed is 
(0.173, 0.0004) and (0.454, 0.02) for LADgd- 

In general, LAD descriptors are insensitive to self- 
connection cases; however, an LADqd profile is sometimes 
not consistent at the location of self-connecting regions. 
Given another domain swapping example in Figure 6, an 
open-form Ribonuclease A (PDB code: IjsO, chain A) 
changes to a closed-form (PDB code: 3di8, chain A). The 
swapped domain (yellow surface) bends and intertwines with 
the protein body (blue surface) via conformational changes 
of highly flexible hinge loops (red surface) (see Figure 7a and 
Figure 7b). In Figure 7c, it is obvious that the LADed varies 
slightly between the open- and close-form states fi:om H105 



Table 3 Comparison with non-alignment methods on 
Liu's dataset 



Method 


Fi -measure (%) 


LADed 


43.27 


LADgd 


43.18 


Diffusion distance (DD) 


37.04 


Inner distance (ID) 


35.83 


Shape distribution (SD) 


28.40 


Euclidean distance (ED) 


28.81 


Solid angle histogram (SAH) 


25.69 


Geodesic distance (GD) 


26.42 


Spherical harmonic descriptor (SHD) 


23.93 


The results are taken from [36] except LADed and LADgd- 



to A109 residues (magenta rectangle). In contrast, the 
LADgd of close-form state is higher than that of open-form 
state at corresponding highlighted regions (see Figure 7d). 
For a detailed illustration, it can be imagined a path from 
the residue H105 to its +3 position (V108). When the 
swapped domain locates apart from the protein body in the 
open-form state, the GD between these two residues is the 
shortest path along the white surface. The GD and ED be- 
tween the two residues in the open-form state are 11.12 A 
and 10.37 A respectively. However, the path was changed 
while the swapped domain bending to the body and inter- 
twining with the white surface region forming a self- 
connection case. The GD is increased significantly due to 
an additional mountain (yellow region in Figure 7b) 
obstructing the original path from residue HI 05 to V108. 
The ED maintained high similarity since its path directly 
passed through the mountain instead of along on the sur- 
face. The GD and ED between the two residues of the 
close-form state are 16.77 A and 9.57 respectively. This 
phenomenon is the main reason why an LADgd descrip- 
tor more sensitive to the topological changes than LADed. 

Differences between the previous and proposed ED/GD 
based methods 

In previous studies [34-36], ED and GD were shown to be 
sensitive to shape deformation and not feasible for flexible 
molecular shape comparison. However, it is interesting that 
relying on the proposed LAD methods, both features be- 
come insensitive to topological changes and reveal deform- 
ation invariant properties to tackle with the flexibility 
problems. The reason for sensitive ED and GD features in 
previous studies is that both distances were computed 
among all global landmark points. On the contrary, the 
LAD exploits the characterization of local geometric fea- 
tures for each residue and its neighbouring residues. There- 
fore, ED and GD features become much less sensitive to 
global topological changes. 

Computational time 

Pairwise comparison of LAD profiles was performed by 
a modification of Smith- Waterman algorithm and pos- 
sessed the same time complexity. The goal of a sequence 
alignment problem is to identify the correspondence of 
residues between two given proteins, whfle a structure 
alignment emphasizes on finding both an alignment and 
a spatial superposition. Possible combinations of corre- 
sponding residues are countable whfle possibflities of 
special superposition are innumerable. Therefore, the 
computational complexity of the proposed algorithm is 
inherently less than most commonly used structure 
alignment methods [66]. The LAD algorithm was imple- 
mented by C# .NET running on an Intel Core 15-2500 
3.3GHz computer with 16GB ram. According to the 
551478 pairwise comparisons mentioned in the result 
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Figure 6 An example of self-connection case for domain swapping proteins of 1a2w:A and 5rsa:A. The yellow part of open-form 
Ribonuclease A (PDB code: la2w, chain A) swaps toward the protein body to form a closed-form (PDB code: 5rsa, chain A), (a) The structural 
alignment of 1a2w:A (green) and 5rsa:A (blue); the hinge loops are highlighted in red. The backbone surfaces of la2w: A (b) and 5rsa:A (c) are 
different due to domain swapped and self-connected formation. However, LADed (d) and LADgd (e) profiles for both structures 
remain consistent. 



section, it only cost an average computational time of 
3.896 and 4.828 milliseconds per comparison for LADed 
and LADgd profiles respectively. 

Conclusions 

We proposed a novel profile-based alignment method, 
named LAD, for pairwise flexible protein structure com- 
parison. It can be constructed in a sense of any kind of 
spatial measures of local neighbouring residues within a 



specific sliding window. Here, GD and ED were used to 
build LADgd and LADed profiles. The idea of LAD im- 
proves the ED- and GD-based descriptors which were 
previously shown to be sensitive to molecular shape de- 
formation, in particular to topologically structural changes. 
The effectiveness of LAD descriptor has been evaluated on 
two datasets of hinge bending motions from the Mol- 
MovDB. Our methods are robust to deformed flexible mo- 
lecules and achieve good performance regarding assignment 
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Figure 7 Illustrating the variation between LADed and LADgd for a self-connection case.The 3D domain-swapped Ribonuclease A consists 
of a protein body (green/blue surface), a liinge loop (red surface) and a swapped domain (yellow surface), (a) The open-form Ribonuclease A 
(PDB code: IjsO, chain A) (b) The domain-swapped closed homolog of (a) (PDB code: 3di8, chain A), (c) and (d) are LADed and LADgd profiles of 
both closed- and open-form structures respectively. The red solid curve of (a) and (b) denotes a GD path, which is the shortest path along the 
surface (white surface region) connecting two residues HI 05 and VI 08 (red spheres). The residues from HI 05 to A109 of both proteins are shown 
as magenta sticks and highlighted within a magenta box in (c) and (d). The black dashed line of (a) and (b) indicates the ED path between the 
residues HI 05 and VI 08. Note that the magenta box has shown that the LADgd profile is more sensitive at the topological changed locations 
than the LADed profile. 



of the queries to different classes of molecules with confor- 
mational changes, and the results have shown superior 
performance compared to existing alignment- and non- 
alignment-based tools. Finally, the reasons of LAD 
descriptor being insensitive to flexible proteins with self- 
connection circumstance was described by taking 3D do- 
main swapping cases as examples, and further discussion 
of LADed possessing more robust properties than LADgd 



was also explained. Required computational time for pair- 
wise LADed/LADgd profile comparisons was analyzed to 
demonstrate its feasibility for constructing an on-line struc- 
ture comparison system. The proposed descriptor is indeed 
effective in retrieving deformed proteins and it could be an 
alternative approach for database search, discovery of 
previously unknown conformational relationships, and 
reorganization of protein structure classification. 
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Availability of supporting data 

The training and testing datasets for our method can be 
obtained from previously published papers by Chu CH 
[63] and Flores SC [37,38]. 
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